Adds lm_eval to evaluations #282

bigximik · 2025-06-02T07:30:10Z

✨ Description

Add lm_eval integration which enables running evaluation directly from an in-memory model. Currently, only data parallelism is supported.

Closes #199

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Improved Weights & Biases (wandb) Logging
The final log for each step now explicitly commits the data, ensuring that the step is immediately visible in the wandb interface. Previously, a step would only appear after logging began for the following step.
Integration of lm_eval Evaluator
Added support for lm_eval as a built-in evaluation method. This enables running standard language model benchmarks during or after training using the Evaluation Harness.
Extended Support for Distributed Primitives
Introduced custom distributed utilities adapted from PyTorch to support non-standard process groups to support lm_eval integration.
Added User Guide for Evaluations
Included documentation explaining how to configure and use evaluators such as loss and lm_eval, both during training and in standalone evaluation mode.

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

bigximik · 2025-06-30T15:21:01Z

Log-likelihood calculation in the lm_eval wrapper is slow because post-processing of logits is performed on the CPU (see TODOs and NOTEs in the code). However, it is still faster than generation without KV cache. I propose optimizing it after KV cache implementation in a separate PR.

fast_llm/cli.py

fast_llm/core/distributed.py

jlamypoirier · 2025-07-01T12:44:06Z

fast_llm/engine/evaluation/config.py

+        " passed to the Fast-LLM lm_eval model wrapper.",
+    )
+
+    add_bos_token: bool = Field(


Could this be determined by the model itself?

as i understand from comments across the code some models just underperform without it but by default it is off for all models

jlamypoirier · 2025-07-01T13:38:47Z

fast_llm/engine/evaluation/lm_eval/fast_llm_wrapper.py

+        print(f"Determined largest batch size: {self.batch_sizes[sched]}")
+        return self.batch_sizes[sched]
+
+    def _loglikelihood_tokens(


What does this do?

It computes per-token log-likelihoods, which are used by lm_eval to calculate various evaluation metrics. The function applies softmax to the returned logits to obtain token-level probabilities. Currently, the logits are moved to the CPU for post-processing, which can be slow but helps avoid GPU memory pressure.

There is a TODO to distribute this function and return only per-token log-probabilities from each shard instead of gathering all logits centrally. However, I prefer to implement that in a separate PR.

fast_llm/engine/evaluation/lm_eval/fast_llm_wrapper.py

fast_llm/engine/evaluation/lm_eval/utils.py

jlamypoirier · 2025-07-01T13:53:14Z

fast_llm/engine/training/wandb.py

        # Note: metrics modified in-place
        if self._wandb is not None:
            import wandb

-            wandb.log(metrics, step=completed_steps)  # noqa
+            wandb.log(metrics, step=completed_steps, commit=commit)  # noqa


Why do we want this?

This approach allows the code to log multiple times for the same step. When the final log for that step is recorded, all previous logs for that step become immediately visible in Weights & Biases (wandb). There's no longer a need to wait for the next step to start logging for the previous step's logs to appear. Previously, logs for a step would remain hidden until logging began for the next step.

…d only internally, made max_lenght settable

jlamypoirier

It's getting close, but the lack of tests will be a problem. Do you intend to add some?

fast_llm/cli.py

fast_llm/core/distributed.py

fast_llm/engine/evaluation/config.py

fast_llm/engine/evaluation/lm_eval/fast_llm_wrapper.py

tests/utils/utils.py

jlamypoirier

The main code looks ready, but I have some concerns with the tests. If we want to merge now I suggest moving the test changes to another PR, then I can approve right away.

tests/utils/run_test_script.py

tests/utils/model_configs.py

tests/models/test_lm_eval_simple.py

jlamypoirier · 2025-07-11T18:04:13Z

tests/models/test_lm_eval_simple.py

+    ]
+
+
+@pytest.mark.extra_slow


How long does this take? It would be worrying not to have any tests other than extra-slow.

very long 40-80 sec per test

jlamypoirier · 2025-07-15T02:20:05Z

tests/models/test_lm_eval.py

+
+
+@pytest.fixture(scope="function")
+def get_lm_eval_config(tokenizer_path, monkeypatch):


I managed to reduce test time by trimming bloat in lm_eval, reducing evaluation size and restricting task to wikitest (others are much slower). Should still be enough since we're only testing fast-llm's wrapper, not lm_eval itself. It's now below 10 seconds so we no longer need to mark as extra_slow.

cool, thanks

jlamypoirier · 2025-07-15T02:21:09Z

tests/models/test_lm_eval.py

+    run_test_script_for_all_models(
+        get_lm_eval_config(run_test_script_base_path / "test_lm_eval_in_training"),
+        compare="test_lm_eval_in_training",
+        prepare_fn=_prepare_resume_fn,


This doesn't make sense, we're starting an evaluation and not resuming training, so we should use the checkpoint as a pretrained model

Okay, I created a function that simply copies the training run with a checkpoint, but the evaluation still starts from the training config. I haven’t tested it starting from a pretrained checkpoint yet — maybe we should leave that for another PR?

jlamypoirier · 2025-07-15T02:23:00Z

setup.cfg

+    cartesia_pytorch>=0.0.2
+
+GENERATION =
+    lm_eval>=0.4.9


Added dependency, packages we use should be in setup.

Maybe the section should be named 'evaluations', since it's not specifically for generation?

jlamypoirier · 2025-07-15T02:25:11Z

fast_llm/utils.py

@@ -392,3 +394,23 @@ def enabled(self) -> bool:
    @property
    def interrupted(self):
        return self._interrupted
+
+
+def set_global_variables(disable_torch_dynamo: bool = False) -> None:


Made this into a separate method so we can call in more places ex. conftest and cli things that don't go through Run

jlamypoirier

Should be ready to merge once remaining comment is addressed and tests pass

This was referenced Jun 2, 2025

Support additional evaluation frameworks #283

Open

Refactoring of Evaluation and adding of evaluate command #264

Merged

Base automatically changed from denis/evaluate to main June 19, 2025 15:44

copy from sandbox

cb744b2

bigximik force-pushed the denis/lm_eval branch from d0bb9cb to cb744b2 Compare June 20, 2025 11:15

bigximik added 17 commits June 20, 2025 12:43

changes for loss test for new tests structure

0967483

lm_eval integration changes for the new api

71ff61a

made lm_eval dependency lazy imported for optional dependency

79fd43e

removed hard coded batch size

2d9f479

remved unncecessary set to evaluatation

7c62100

commit wandb step after finishing logging

c89d269

support for env varieables for lm_eval integration

9455cd5

merge from main

69180a3

user guide for evaluators added

c9a3b18

fix tensor concatination for logits from different gpus

426b5e3

docs update

0bf8282

removed manual test configs

68f524b

added debug prints

a36e0be

fix for gather_list and remove debug print

9baa512

removed debug print

21678ab

moved returned logits to cpu in lm_eval wrapper

7cccf9a

fix to move all logits computations to cpu

7cd681a

bigximik changed the title ~~[work in progress] Adds lm_eval to evaluations~~ Adds lm_eval to evaluations Jun 30, 2025

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/lm_eval

59ff1e5

bigximik marked this pull request as ready for review June 30, 2025 15:17

bigximik requested a review from jlamypoirier June 30, 2025 15:17

jlamypoirier reviewed Jul 1, 2025

View reviewed changes

bigximik added 2 commits July 2, 2025 09:01

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/lm_eval

27e5de8

fix typo

88faca0

bigximik added 4 commits July 2, 2025 14:00

zero stage 3 inference warning added and TODO

c398444

removed docstrings

62846d2

removed unused fields, change generate call

e61cc3e

changed to all fields to be private, removed properties which are use…

6a2ab35

…d only internally, made max_lenght settable

bigximik requested a review from tscholak July 8, 2025 10:06

jlamypoirier reviewed Jul 8, 2025

View reviewed changes

fast_llm/cli.py Outdated Show resolved Hide resolved

fast_llm/core/distributed.py Outdated Show resolved Hide resolved

Simplify scatter/gather

6e1704f

jlamypoirier reviewed Jul 8, 2025

View reviewed changes

bigximik added 10 commits July 9, 2025 13:01

clean up, more comments

2499b4e

fixed tipo

44aa138

moved setting of NUMEXPR_MAX_THREADS

f81a673

Evaluators renames

d56ce57

return change

b32c91f

change local function to lambda

93091dd

somme speedup

50e65ee

fix not to log absent head output

d32258e

added lm_eval integration tests

98d1d77

fix not removal comment for import

9f2de97

jlamypoirier reviewed Jul 11, 2025

View reviewed changes

tests/utils/utils.py Outdated Show resolved Hide resolved

jlamypoirier reviewed Jul 11, 2025

View reviewed changes

bigximik and others added 6 commits July 14, 2025 10:29

docs update

b451543

scatter fix

910d54e

fix offset normalization in validation

077f2ac

tests polishing

ac9025d

more tests polishing

30d85df

fixes

f60fa35

jlamypoirier reviewed Jul 15, 2025

View reviewed changes

bigximik added 2 commits July 15, 2025 16:11

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/lm_eval

ada41ca

changed prepare funciton to just copy traning runs

2f5d2d0



		@pytest.fixture(scope="function")
		def get_lm_eval_config(tokenizer_path, monkeypatch):

		]


		@pytest.mark.extra_slow

Adds lm_eval to evaluations #282

Are you sure you want to change the base?

Adds lm_eval to evaluations #282

Uh oh!

Conversation

bigximik commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

Uh oh!

bigximik commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigximik commented Jun 2, 2025 •

edited

Loading

bigximik commented Jun 30, 2025 •

edited

Loading